Search Results: "Erich Schubert"

21 May 2013

Erich Schubert: Google Hangouts drops XMPP support

Update: today I've been receiving XMPP messages in the Google+ variant of Hangouts. Looks as if it currently is back (at least while you are logged in via XMPP - havn't tried without pidgin at the same time yet). Let's just hope that XMPP federation will continue to be supported on the long run.

It's been all over the internet, so you probably heard it already: Google Hangouts no longer receives messages from XMPP users. Before, you could easily chat with "federated" users from other Jabber servers.

While of course the various open-source people are not amused -- for me, most of my contacts disappeared, so I then uninstalled Hangouts to get back Google Talk (apparently this works if Talk was preinstalled in your phones firmware) -- this bears some larger risks for Google:

Reputation: Google used to have the reputation of being open. XMPP support was open, the current "Hangups" protocol is not. This continuing trend of abandoning open standards and moving to "walled garden" solutions will likely harm the companies reputation in the open source community
Legal risk of an antitrust action: Before, other competitors could interface with Google using an indepentend and widely accepted standard. An example is United Internet in Germany, which operates for example the Web.de and GMX platforms, mail.com, the 1&1 internet provider. By effectively locking out its competitors - without an obvious technical reason, as XMPP was working fine just before, and apparently continues to be used at Google for example in AppEngine - bears a high risk of running into an antitrust action in Europe. If I were 1&1, I would try to get my lawyers started... or if I were Microsoft, who apparently just wanted to add XMPP messaging to Hotmail?
Users: Google+ is not that big yet. Especially in Germany. Since 90% of my contacts were XMPP contacts, where am I likely going to move to: Hangouts or another XMPP server? Or back to Skype? I still use Skype for more Voice calls than Google (which I used like twice), because there are some people that prefer Skype. One of these calls probably was not using the Google plugin, but an open source phone. Because with XMPP and Jingle, my regular chat client would interoperate. An in fact, the reason I started using Google Talk the first place was because it would interoperate with other networks, too, and I assumed they would be good at operating a Jabber server.

In my opinion, Google needs to quickly restore a functioning XMPP bridge. It is okay if they offer add-on functionality only for Hangout users (XMPP was always designed to allow for add-on functionality); it is also okay if they propose an entirely new open protocol to migrate to on the long run, if they can show good reasons such as scalability issues. But the way they approached the Hangup rollout looks like a big #fail to me.

Oh, and there are other issues, too. For example Linus Torvalds complains about the fonts being screwed up (not hinted properly) in the new Google+, others complain about broken presence indicators (but then you might as well just send an email, if you can't tell whether the recepient will be able to receive and answer right away), but using Hangouts will apparently also (for now -- rumor has it that Voice will also be replaced by Hangups entirely) lose you Google Voice support. The only thing that seems to give positive press are the easter eggs...

All in all, I'm not surprised to see over 20% of users giving the lowest rating in the Google Play Store, and less than 45% giving the highest rating - for a Google product, this must be really low.

28 February 2013

Erich Schubert: ELKI data mining in 2013

ELKI, the data mining framework I use for all my research, is coming along nicely, and will see continued progress in 2013. The next release is scheduled for SIGMOD 2013, where we will be presenting the novel 3D parallel coordinates visualization we recently developed. This release will bear the version number 0.6.0.

Version 0.5.5 of ELKI is in Debian unstable since december (Version 0.5.0 will be in the next stable release) and Ubuntu raring. The packaged installation can share the dependencies with other Debian packages, so they are smaller than the download from the ELKI web site.

If you are developing cluster analysis or outlier detection algorithm, I would love to see them contributed to ELKI. If I get a clean and well-integrated code by mid june, your algorithm could be included in the next release, too. Publishing your algorithms in source code in a larger framework such as ELKI will often give you more citations. Because it is easier to compare with your algorithm then and to try it on new problems. And, well, citations counts are a measure that administration loves to judge researchers ...

So what else is happening with ELKI:

The new book "Outlier Analysis" by C. C. Aggarwal mentions ELKI for visual evaluation of outlier results as well as in the "Resources for the Practioner" section and cites around 10 publications closely related to ELKI.
Some classes for color feature extraction of ELKI have been contributed to jFeatureLib, a Java library for feature detection in image data.
I'd love to participate in the Google Summer of Code, but I need a contact at Google to "vouch" for the project, otherwise it is hard to get in. I've been sending a couple of emails, but so far have not heard back much yet.
As the performance of SVG/Batik is not too good, I'd like to see more OpenGL based visualizations. This could also lead to an Android based version for use on tablets.
As I'm not an UI guy, I would love to have someone make a fancier UI that still exposes all the rich functions we have. The current UI is essentially an automatically generated command line builder - which is nice, as new functionality shows up without the need to modify UI code. It's good for experienced users like me, but hard for beginners to get started.
I'd love to see integration of ELKI with e.g. OpenRefine / Google Refine to make it easier to do appropriate data cleaning and preprocessing
There is work underway for a distributed version running on Hadoop/YARN.

3 December 2012

Erich Schubert: ResearchGate Spam

Update Dec 2012: ResearchGate still keeps on sending me their spam. Most of the colleagues I had that tried out RG now deleted their account there, apparently, so the invitation mails become fewer. Please do not try to push this link on Wikipedia just because you are also annoyed by their emails. My blog is not a "reliable source" by Wikipedia standards. It solely reflects my personal view of that web site, not journalistic or scientific research. The reason why I call ResearchGate spam is the weasel words they use to trick authors into sending the invitation spam. Here's the text coming with the checkbox you need to uncheck (from the ResearchGate "blog")

Add my co-authors that are already using ResearchGate as contacts and invite those who are not yet members.

See how it is worded so it sounds much more like "link my colleagues that are already on researchgate" instead of "send invitation emails to my colleagues"? It deliberately avoids the mentioning of "email", too. And according to the researchgate news post, this is hidden in "Edit Settings", too (I never bothered to try it -- I do not see any benefit to me in their offers, so why should I?). Original post below:

If you are in science, you probably already received a couple of copies of the ResearchGate spam. They are trying to build a "Facebook for scienctists", and so far, their main strategy seems to be aggressive inivitation spam. So far, I've received around 5 of their "inivitations", which essentially sound like "Claim your papers now!" (without actually getting any benefit). When I asked my colleagues about these invitations none actually meant to invite me! This is why I consider this behaviour of ResearchGate to be spam. Plus, at least one of these messages was a reminder, not triggered by user interaction. Right now, they claim to have 1.9 million users. They also claim "20% interact at least once a month". However, they have around 4000 Twitter followers and Facebook fans, and their top topics on their network are at like 10000-50000 users. That is probably a much more real user count estimation: 4k-40k. And these "20%" that interact, might just be those 20% the site grew in this timeframe and that happened to click on the sign up link. For a "social networking" site, these numbers are pointless anyway. That is probably even less than MySpace. Because I do not see any benefit in their offers! Before going on an extremely aggressive marketing campaign like this, they really should consider to actually have something to offer... And the science community is a lot about not wasting their time. It is a dangerous game that ResearchGate is playing here. It may appeal to their techies and investors to artificially inflate their user numbers in the millions. But if you pay for the user numbers with your reputation, that is a bad deal! Once you have the reputation as being a spammer (and mind it, every scientist I've talked to so far complained about the spam and "I clicked on it only to make it stop sending me emails") it's hard to be taken serious again. The scientific community is a lot about reputation, and ResearchGate is screwing up badly on this. In particular, according to researchgate founder on quora, the invitations are opt-out on "claiming" a paper. Sorry, this is wrong. Don't make users annoy other users by sending them unwanted invitations to a worthless service! And after all, there are alternatives such as Academia and Mendeley that do offer much more benefit. (I do not use these either, though. In my opinion, they also do not offer enough benefit to bother going to their website. I've mentioned the inaccuracy of Mendeleys data - and the lack of an option to get them corrected - before in an earlier blog post. Don't rely on Mendeley as citation manager! Their citation data is unreviewed. I'm considering to send ResearchGate (they're Berlin based, but there maybe also is a US office you could direct this to) a cease and desist letter, denying them to store personal information on me, and to use my name on their websites to promote their "services". They may have visions of a more connected and more collaborative science, but they actually don't have new solutions. You can't solve everything by creating yet another web forum and "web2.0izing" everything. Although many of the web 2.0 bubble boys don't want to hear it: you won't solve world hunger and AIDS by doing another website. And there is a life outside the web.

21 November 2012

Axel Beckert: Suggestions for the GNOME Team

Thanks to Erich Schubert s blog posting on Planet Debian I became aware of the 2012 GNOME User Survey at Phoronix. Like back in 2006 I still use some GNOME applications, so I do consider myself as GNOME user in the widest sense and hence I filled out that survey. Additionally I have to live with GNOME 3 as a system administrator of workstations, and that s some kind of usage, too. ;-) The last question in the survey was Do you have any comments or suggestions for the GNOME team? Sure I have. And since I tried to give constructive feedback instead of only ranting, here s my answer to that question as I submitted it in the survey, too, just spiced up with some hyperlinks and highlighting:

Don t try to change the users. Give the users more possibilities to change GNOME if they don t agree with your own preferences and decisions. (The trend to castrate the user was already starting with GNOME 2 and GNOME 3 made that worse IMHO.) If you really think that you need less configurability because some non-power-users are confused or challenged by too many choices, then please give the other users at least the chance to enable more configuration options. A very good example in that hindsight was Kazehakase (RIP) who offered several user interfaces (novice, intermediate and power user or such). The popular text-mode web browser Lynx does the same, too, btw. GNOME lost me mostly with the change to GNOME 2. The switch from Galeon 1.2 to 1.3/2.0 was horrible and the later switch to Epiphany made things even worse on the browser side. My short trip to GNOME as desktop environment ended with moving back to FVWM (configurable without tons of clicking, especially after moving to some other computer) and for the browser I moved on to Kazehakase back then. Nowadays I m living very well with Awesome and Ratpoison as window managers, Conkeror as web browser (which are all very configurable) and a few selected GNOME applications like Liferea (luckily still quite configurable despite I miss Gecko s about:config since the switch to WebKit), GUCharmap and Gnumeric. For people switching from Windows I nowadays recommend XFCE or maybe LXDE on low-end computers. I likely would recommend GNOME 2, too, if it still would exist. With regards to MATE I m skeptical about its persistance and future, but I m glad it exists as it solves a lot of problems and brings in just a few new ones. Cinnamon as well as SolusOS are based on the current GNOME libraries and are very likely the more persistent projects, but also very likely have the very same multi-head issues we re all barfing about at work with Ubuntu Precise. (Heck, am I glad that I use Awesome at work, too, and all four screens work perfectly as they did with FVWM before.)

Thanks to Dirk Deimeke for his (German written) pointer to Marcus Moeller s interview with Ikey Doherty (in German, too) about his Debian-/GNOME-based distribution SolusOS.

Erich Schubert: Phoronix GNOME user survey

While not everybody likes Phoronix (common complaints include tabloid journalism), they are doing a GNOME user survey again this year. If you are concerned about Linux on the desktop, you might want to participate; it is not particularly long.

Unfortunately, "the GNOME Foundation still isn't interested in having a user survey", and may again ignore the results; and already last year you could see a lot of articles along the lines of The Survey That GNOME Would Rather Ignore. One more reason to fill it out.

13 November 2012

Erich Schubert: Migrating from GNOME3 to XFCE

I have been a GNOME fan for years. I actually liked the switch from 1.x to 2.x, and at some point switched to 3.x when it became somewhat usable. At some point, I even started some small Gnome projects, one even was uploaded to the Gnome repositories. But I didn't have much time for my Linux hobby anymore back then.

However, I am now switching to XFCE. And for all I can tell, I am about the last one to make that switch. Everybody I know hates the new Gnome.

My reason is not emotional. It's simple: I have systems that don't work well with OpenGL, and thus don't work well with Gnome shell. Up to now, I can live fine with "Fallback mode" (aka: Gnome classic). It works really good for me, and does exactly what I need. But it has been all over the media: Gnome 3.8 will drop 'fallback' mode.

Now the choice is obvious: instead of switching to shell, I go to XFCE. Which is much closer to the original Gnome experience, and very productivity oriented.

There are tons of rants on GNOME 3 (for one of the most detailed ones, see Gnome rotting in threes, going through various issues). Something must be very wrong about what they are doing to receive this many sh*tstorms all the time. Every project receives some. I've even received a share of the Gnome 2 storms when Galeon (an early Gnome browser) made the move and started dropping some of the hard-to-explain and barely used options that would break with every other Mozilla release. And Mozilla embedding was a major pain these days. Yet, for every feature there would be some user somewhere that loved it, and as Debian maintainer of Galeon, I got to see all the complaints (and at the same time was well aware of the bugs caused by the feature overload).

Yet with Gnome 3, things are IMHO a lot different. In Gnome 2, it was a lot about making things more usable as they are, a bit cleaner and more efficient. With Gnome 3, it seems to be about experimenting with new stuff. Which is why it keeps on breaking APIs all the time. For example themeing GTK 3 is constantly broken; most of the themes available just don't work. Similar Gnome Shell extensions - most of them work with exactly one version of Gnome Shell (doesn't this indicate the author has abandoned Gnome shell?).

But the one thing that was really sticking out was when my I updated the PC of my dad. Apart from some glitches, he could not even shutdown his PC with Gnome-shell. Because you needed to press the Alt button to actually get a shutdown option.

This is indicative of where Gnome is heading: something undefined inbetween of PCs, tablets, media centers and mobile phones. They just decided that users don't need to shutdown anymore, so they could as well drop that option.

But the worst thing about the current state of GNOME is: They happily live with it. They don't care that they are losing users by the dozens. Because to them, these are just "complainers". Of cousre there is some truth in "Complainers gonna complain and haters gonna hate". But what Gnome is receiving is way above average. At some point, they should listen. 200 posts long comment chains from dozens of peopls on LWN are not just your average "complaints". It's an indicator that a key user base is unhappy with the software. In 2010 GNOME 2 had 45% market share in the LinuxQuestions poll, XFCE had 15%. In 2011, GNOME 3 had 19%, and XFCE jumped to 28%. And I wouldn't be surprised if GNOME 3 shell (not counting fallback mode) would clock at less than 10% in 2012 - despite being default.

Don't get me wrong: there is a lot on Gnome that I really like. But as they decided to drop my preferred UI, I am of course looking for alternatives. In particular, as I can get lots of the Gnome 3 benefits with XFCE. There is a lot in the Gnome ecosystem that I value, and that IMHO is driving Linux forward. Network-manager, Poppler, Pulseaudio, Clutter just to name a few. Usually, the stuff that is modular is really good. And in fact I have been a happy user of the "fallback" mode, too. Yet, the overall "desktop" Gnome 3 goals are in my opinion targeting the wrong user group. Gnome might need to target linux developers more again, to keep a healthy development community around. Frequently triggering sh*tstorms by high-profile people such as Linux Torvalds is not going to strengthen the community. There is nothing wrong in the FL/OSS community to encourage people to use XFCE. But these are developers that Gnome might need at some point.

On a backend / technical level (away from the Shell/UI stuff that most of the rants are about), my main concern about the Gnome future is GTK3. GTK2 was a good toolkit for cross-platform development. GTK3 as of now is not, but is largely a Linux/Unix only toolkit - in particular, because there apparently is no up to date Win32 port. With GTK 3.4 it was said that they are now working on Windows - but as of GTK 3.6 they are still nowhere to be found. So if you want to develop cross-platform, as of now, you better stay away from GTK 3. If this doesn't change soon, GTK might sooner or later lose the API battle to more portable libraries.

2 November 2012

Erich Schubert: DBSCAN and OPTICS clustering

DBSCAN [wikipedia] and OPTICS [wikipedia] are two of the most well-known density based clustering algorithms. You can read more on them in the Wikipedia articles linked above.

An interesting property of density based clustering is that these algorithms do not assume clusters to have a particular shape. Furthermore, the algorithms allow "noise" objects that do not belong to any of the clusters. K-means for examples partitions the data space in Voronoi cells (some people claim it produces spherical clusters - that is incorrect). See Wikipedia for the true shape of K-means clusters and an example that canot be clustered by K-means. Internal measures for cluster evaluation also usually assume the clusters to be well-separated spheres (and do not allow noise/outlier objects) - not surprisingly, as we tend to experiment with artificial data generated by a number of Gaussian distributions.

The key parameter to DBSCAN and OPTICS is the "minPts" parameter. It roughly controls the minimum size of a cluster. If you set it too low, everything will become clusters (OPTICS with minPts=2 degenerates to a type of single link clustering). If you set it too high, at some point there won't be any clusters anymore, only noise. However, the parameter usually is not hard to choose. If you for example expect clusters to typically have 100 objects, I'd start with a value of 10 or 20. If your clusters are expected to have 10000 objects, then maybe start experimenting with 500.

The more difficult parameter for DBSCAN is the radius. In some cases, it will be very obvious. Say you are clustering users on a map. Then you might know that a good radius is 1 km. Or 10 km. Whatever makes sense for your particular application. In other cases, the parameter will not be obvious, or you might need multiple values. That is when OPTICS comes into play.

OPTICS is based on a very clever idea: instead of fixing MinPts and the Radius, we only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. In order to sort the objects on this plot, we process them in a priority heap, so that nearby objects are nearby in the plot. this image on Wikipedia shows an example for such a plot.

OPTICS comes at a cost compared to DBSCAN. Largely because of the priority heap, but also as the nearest neighbor queries are more complicated than the radius queries of DBSCAN. So it will be slower, but you no longer need to set the parameter epsilon. However, OPTICS won't produce a strict partitioning. Primarily it produces this plot, and in many situations you will actually want to visually inspect the plot. There are some methods to extract a hierarchical partitioning out of this plot, based on detecting "steep" areas.

The open source ELKI data mining framework (package "elki" in Debian and Ubuntu) has a very fast and flexible implementation of both algorithms. I've benchmarked this against GNU R ("fpc" package") and Weka, and the difference is enormous. ELKI without index support runs in roughly 11 minutes, with index down to 2 minutes for DBSCAN and 3 minutes for OPTICS. Weka takes 11 hours and GNU R/fpc takes 100 minutes (DBSCAN, no OPTICS available). And the implementation of OPTICS in Weka is not even complete (it does not support proper cluster extraction from the plot). Many of the other OPTICS implementations you can find with Google (e.g. in Python or MATLAB) seem to be based on this Weka version ...

ELKI is open source. So if you want to peek at the code, here are direct links: DBSCAN.java, OPTICS.java.

Some part of the code may be a bit confusing at first. The "Parameterizer" classes serve the purpose of allowing automatic UI generation, for example. So there is quite a bit of meta code involved.

Plus, ELKI is quite extensively optimized. For example, it does not use Java Collections much anymore. Java Iterators, for example, require returning an object on next();. The C++ style iterators used by ELKI can have multiple values, and primitive values.

for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())

is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.

ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);

is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).

Java advocates always accuse you of premature optimization when you avoid creating objects for primitives. Yet, in all my benchmarking, I have seen this continuously to have a major impact how many objects you allocate. At least when it is inside a loop that is heavily used. Java collections with boxed primitives just eat a lot of memory, and the memory management overhead does often make a huge difference. Which is why libraries such as Trove (which ELKI uses a lot) exist. Because memory usage does make a difference.

(Avoiding boxing/unboxing systematically in ELKI yielded approximately a 4x speedup. But obviously, ELKI involves a lot of numerical computations.)

22 October 2012

Erich Schubert: Changing Gnome 3 colors

One thing that many people dislike about Gnome 3, in my opinion is that the authors/maintainers impose a lot of decisions on you. They are in fact not really hard coded, but I found documentation to be really inaccessble on how to change them.

For example colors. I found it extremely badly documented on how to customize GTK colors. And at the same time, a lot of the themes do not work reliably across different Gnome versions. For example the unico engine in Debian experimental is currently incompatible with the main GTK version there (and even worse, GTK does not realize this and refuse to load the incompatible engine). A lot of the themes you can get on gnome-look.org for example use unico. So it's pretty easy to get stuck with a non-working GTK 3, this really should not happen that easily. (I do not blame the Debian maintainers to not have worked around this using package conflicts yet - it's in experimental after all. But upstream should know when they are breaking APIs!)

For my work on the ELKI data mining framework I do a lot of work in Eclipse. And here GTK3 really is annoying, in particular the default theme. Next to unusable, actually, as code documentation tooltips show up black-on-black.

Recently, Gnome seems to be mostly driven by a mix of design and visual motivation. Gnome shell is a good example. No classic Linux user I've met likes it, even my dad immediately asked me how to get the classic panels back. It is only the designers that seem to love it. I'm concerned that they are totally off on their audience, they seem to target the mac OSX users instead of the Linux users. This is a pity, and probably much more a reason why Gnome so far does not succeed on the Desktop: it keeps on forgetting the users it already has. They by now seem to move to XFCE and LXDE because neither the KDE nor the Gnome crowd care about classic Linux users in the hunt for copying OSX & Co.

Anyway, enough ranting. Here is a simple workaround -- that hopefully is more stable across GTK/Gnome versions than all those themes out there -- that just slightly adjusts the default theme:

$ gsettings set \
org.gnome.desktop.interface gtk-color-scheme '
os_chrome_fg_color:black;
os_chrome_bg_color:#FED;
os_chrome_selected_fg_color:black;
os_chrome_selected_bg_color:#f5a089;
selected_fg_color:white;
selected_bg_color:#c50039;
theme_selected_fg_color:black;
theme_selected_bg_color:#c50039;
tooltip_fg_color:black;
tooltip_bg_color:#FFC;
'

This will turn your panel from a designer-hip black back to a regular grayish work panel. If you are working a lot with Eclipse, you'll love the last two options. That part makes the tooltips readable again! Isn't that great? Instead of caring about what is the latest hipster colors, we now have readable tooltips for developers again instead of all that fancy-schmanzy designer orgasms!

Alternatively, you can use dconf-editor to set and edit the value. The tricky part was to find out which variables to set. The (undocumented?) os_chrome stuff seems to be responsible for the panel. Feel free to change the colors to whatever you prefer!

GTK is quite customizable. And the gsettings mechanism actually is quite nice for this. It just seems to be really badly documented. The Adwaita theme in particular seems to have quite some hard-coded relationships also for the colors. And I havn't found a way (without doing a complete new theme) to just reduce padding, for example. In particular, as there probably are a hundred of CSS parameters that one would need to override to get it into everywhere (and with the next Gnome, there will be again two dozen to add?)

Above method just seems to be the best way to tweak the looks. At least the colors, since that is all that you can do this way. If you want to customize more, you probably have to do a complete theme. At which point, you probably have to redo this at every new version. And to pick on Miguel de Icaza: the kernel APIs are extremely stable, in particular compared to the mess that Gnome has been across versions. And at every new iteration, they manage to offend a lot of their existing users (and end up looking more and more like Apple - maybe we should copy more from where we are good at, instead of copying OSX and .NET?).

9 September 2012

Erich Schubert: Google Plus replacing blogs not Facebook

When Google launched Google+, a lot of people were very sceptical. Some outright claimed it to be useless. I must admit, it has a number of functions that really rock.

Google Plus is not a Facebook clone. It does not try to mimick Facebook that much. To me, it looks much more like a blog thing. A blog system, where everybody has to have a Google account, and then can comment (plus, you can then restrict access and share only with some people). It also encourages you to share shorter posts. Successful blogs always tried to make their posts "articles". Now the posts themselves are merely comments; but not as crazy short as Twitter (it is not a Twitter clone either), and it does have rich media contents, too.

Those who expect it to replace their Facebook where the interaction is all about personal stuff will be somewhat disappointed. Because it IMHO much less encourages the smalltalk type of interaction.

However, it won a couple of pretty high profile people to share their thoughts and web discoveries with the world. Some of the most active users I follow on Google Plus are: Linus Torvalds and Tim O'Reilly (of the publishing house O'Reilly)

Of course I also have a number of friends that share private stuff on Google Plus. But in my opinion the strength of Google Plus is on sharing publicly. Since Google is the king of search, they can both feed shares of your friends into your regular search results, but there is also a pretty interesting search in Google PLus. The key difference is that with this search, the focus is on what is new. Regular web search is also a lot about searching for old things (where you did not bother to remember the address or bookmark the site - and mind it, today a lot of people even "google for Google" ...) For example I like the plus search for data mining because it occasionally has some interesting links in it. A lot of the stuff is coming in again and again, but using the "j and k" keys, I can quickly scroll through these results to see if there is anything interesting. And there are quite a lot of interesting things I've discovered this way.

Note that this can change anytime. And maybe it is because I'm interested in technology stuff that it works well for me. But say, maybe you are more into HDR photography than me (I think they look unreal, as if someone has done way too much contrast and edge enhancing on the image). But go there, and press "j" a number of times to browse through some HDR shots. That is a pretty neat search function there. And if you come back tomorrow, there will likely be new images!

Facebook tried to clone this functionality. Google+ launched in June 2011, and in September 2011, Facebook added "subscribers". So they realized the need for having "non-friends" that are interested in what you are doing. Yet, I don't know anybody actually using it. And the Public posts search is much less interesting than of Google Plus, and the nice keyboard navigation is also missing.

Don't get me wrong, Facebook still has its uses. When I travel, Facebook is great for me to get into contact with locals to go swing dancing. There are a number of events where people only invite you on Facebook (and that is one of the reasons why I've missed a number of events - because I don't use Facebook that much). But mind it, a lot of the stuff that people share on Facebook is also really boring.

And that will actually be the big challenge for Google: keeping the search results interesting. Once you have millions of people there sharing pictures of lolcats - will it still return good results? Or will just about every search give you more lolcats?

And of course, spam. The SEO crowd is just warming up in exploring the benefits of Google Plus. And there are quite some benefits to be gained from connecting web pages to Google Plus, as this will make your search results stick out somehow, or maybe give them that little extra edge over other results. But just like Facebook at some point was so heavily spammed when every little shop was setting up his Facebook pages, inviting everyone to all the events and so on - this is bound to happen on Google Plus, too. We'll see how Google then reacts, and how quickly and effectively.

2 September 2012

Erich Schubert: ELKI call for contributions

ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line. The key feature that sets ELKI apart from existing open source tools used in data mining (e.g. Weka and R) is that it has support for index structures to speed up algorithms, and a very modular architecture that allows various combinations of data types, distance functions, index structures and algorithms. When looking for performance regressions and optimization potential in ELKI, I recently ran some benchmarks on a data set with 110250 images described by 8 dimensional color histograms. This is a decently sized dataset: it takes long enough (usually in the range of 1-10 minutes) to measure true hotspots. When including Weka and R in the comarison I was quite surprised: our k-means implementation runs at the same speed as Rs implementation in C (and around twice that of the more flexible "flexclus" version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order of magnitude faster than Weka and R, and adding index support speeds up the computation by another factor of 5-10x. In the most extreme case - DBSCAN in Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2 minutes (ELKI) as opposed to 11 hours (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference, in particular with DBSCAN and OPTICS there seems to be something seriously broken. Judging from a quick look at it, the OPTICS implementation actually is not even complete, and both implementations actually copy all data out of Weka into a custom linear database, process it there, then feed back the result into Weka. They should just drop that "extension" altogether. The much newer and Weka-like LOF module is much more comparable.) Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together. Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool. But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have. I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.

A web browser frontend would be cool. Maybe even connected to Google Refine, using Refine for preprocessing the data, then migrating it into ELKI for analysis. The current visualization engine of ELKI is using SVG - this should be fairly easy to port into the web browser. Likely, the web browers will even be faster than the current Apache Batik renderer.
Visual programming frontend. Weka, RapidMiner, Orange: they all have visual programming style UIs. This seems to work quite well to model the data flow within the analysis. I'd love to see this for ELKI, too.
Cluster/Cloud backend. ELKI can already handle fairly large data sets on a big enough system. If someone spends extra effort on the index structures, the data won't even need to fit into main memory anymore. Yet, everybody now wants "big data", and parallel computation probably is the future. I'm currently working on some first Hadoop YARN based experiments with ELKI. But this is a huge field, turning this into true "elk yarn". I will likely only lay some foundations (unless I get funding to continue as a PostDoc on this project. I sure hope to get to do at least a few years of postdoc somewhere, as I really enjoy working with students on this kind of project)
New visualization engine. The current visualization engine, based on Apache Batik and SVG is quite okay. It does what I need, which is to get a quick glance at the results and the ability to export them for publications in print quality. (in particular, I can easily edit the SVG files with Inkscape) But it is not really something fancy (although we have a lot of cool visualizations). And it is slow. I havn't found a portable and fast graphics toolkit for Java yet that can produce SVG files. There is a lot of hype around processing, for example, but it seems to be too much about art for me. In fact, I'd love to use either something like Clutter or Cairo. But getting them to work for Windows and Mac OSX will likely be a pain.
Human Computer Interaction (HCI). This is in my opinion the biggest challenge we are facing with all the "big data" stuff. If you really go into big data (and not just run Hadoop and Mahout on a single system; yes - a lot of people seem to do this), you will at some point need to go beyond just crunching the numbers. So far, the challenges that we are tackling are largely data summarization and selection. TeraSort is a cool project, and a challenge. Yet, what do you actually get from sorting this large amount of data? What do you get from running k-means on a terabyte? When doing data mining on a small data set, you quickly learn that the main challenge actually is preprocessing the data and choosing parameters the right way so that your result is not bogus. Unless you are doing simple prediction tasks, you often don't have a clearly defined objective. Sure, when predicting churn rates, you can hope to just throw all the data into a big cloud and hope you get some enlightement out. But when you are doing cluster analysis or outlier detection - unsupervised methods - the actual objective by definition cannot be hardcoded into a computer. The key objective then is learn something new on the data set. But if you want to have your user learn something on the data set, you will have to have the user guide the whole process, and you will have to present results to the user. Which gets immensely more difficult with larger data. Big data just does no longer look like this. And neither are the algorithms as simple as k-means or hierarchical clustering. Hierarchical clustering is good for teaching the basic ideas of clustering. But you will not be using a dendrogram for a huge data set. Plus, it has a naive complexity of O(n^3) and for some special cases O(n^2) - too slow for truly big data.
For the "big data future" once we get over all the excitement of being able to just somehow crunch these numbers we will need to seriously look into what to do with the results (in particular, how to present them to the user), and how to make the algorithms accessible and usable for non-techies. Right now, you cannot expect a social sciences researcher to be able to use a Hadoop cluster. Yet to make sense of the results. But if you are a smart guy to actually solve this, and open up "big data processing" to the average non-IT user, this will be big.
Oh, and of course there are just hundreds of algorithms not yet available (accessible) as open source. Not in ELKI, and usually not anywhere else either. Just to name a few from my wishlist (I could probably implement many of them in a few hours in ELKI, but I don't have the time to do so myself, plus they are good student or starter project to get used to ELKI): BIRCH, CLARA, CLARANS, CLINK, COBWEB, CURE, DOC, DUSC, EDSC, INSCY, MAFIA, P3C, SCHISM, STATPC, SURFING, ... just to name a few.

If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.

30 August 2012

Erich Schubert: Finding packages for deinstallation

On my netbook, I try to keep the amount of installed software limited. Aptitudes "automatically installed" markers are very helpful here, since they allow you to differentiate between packages that were deliberately installed and packages that were manually marked for installation. I quite often browse through the list of installed packages and recheck those that are not marked as "A".

However, packages that are "suggested" by some other package (but not "required") will be kept even when marked as automatically. This is quite sensible: when you deinstall the package that "suggested" them, they will be removed. So this is nice for having optional software also automatically removed.

However sometimes you need the core package but not this optional functionality. Aptitude can help you there, too. Here's an aptitude filter I used to find some packages for removal:

!?reverse-depends(~i) ~M !?essential

It will display only packages with no direct dependency from another installed package and that are marked as automatically installed (so they must be kept installed because of a weaker dependency.

Some examples of "suggested but not required" packages:

Accessibility extensions of Gnome
Spelling dictionaries
Optional functionality / extensions

Depending on your requirements, you might want to keep some of these and remove others.

Here is also a filter to find packages that you can put on "automatically installed":

~i !~M ?reverse-depends(~i) !?essential

This will catch "installed but not automatically installed packages, that another installed package depends on". Note that you should not blindly put all of these to "automatic" mode. For example "logrotate" depends on "cron anacron fcron". If you have both cron and anacron installed, aptitude will consider anacron to be unnecessary (it is - on a system with 24h uptime). So review this list, and see what happens when you set packages to "A", and reconsider your intentions. If it is a software you want for sure, leave it on manual.

15 June 2012

Erich Schubert: Dritte Startbahn - 2 gewinnt!

Usually I don't post much about politics. And this even is a highly controversial issue. Please do not feel offended.

This weekend, there is an odd election in Munich. Outside of Munich, nearby the cities of Freising and Erding, there is the Munich airport. The company operating the airport is owned partially by the city of Munich, which gives the city a veto option.

The Munich airport has grown a lot. Everybody who has been flying a bit knows that big airports (such as Heathrow) are oft the worst. If anything goes wrong, you are busted, because it will take them a long time to resume operations. This just happened to me in Munich, where the luggage system was down, and no luggage arrived at the airplanes.

Yet, they want to take the airport further down this road, and make it even bigger: add two satellite terminals, and a third runway. I'm convinved that this will make the airport much worse for anybody around here. The security checkpoints will be even more crowded, the lines for the baggage drop-off too, and you will have to walk much further on the airport.

Up to now, the Munich airport was pretty good compared to others. In particular given that is is one of the largest in Europe! It is because it had been designed from the ground up for this size. Now they plan to screw it up.

But there are other arguments against this, not the egoist view of a traveller. The first is the money issue. The airport is continuously making losses. It's the taxpayer that has to pay for all of this - and the current cost estimation is 1200 million. This is not appropriate, in particular since history shows that you can take this x2 to x10 to get the real number. They should first get the airport into a financial stable condition, then plan on making it even bigger.

Then there are the numbers. Just like with any polticial large-scale project, the numbers are all fake. The current airport was planned to cost 800 million, in the end it was about 8550 million. The politicians happily lie to us. Because they want to push their pet projects. We must no longer accept such projects based on fake numbers and old predictions.

If you are already one of the 10 largest airports in Europe, can you really expect to grow even further?!? There is a natural limit to growth, unless you want to have every single passenger on the world first travel to Munich multiple times, then go to his final destination ...

One thing they seem to have completely neglected is that Berlin is currently getting a brand new airport. And very likely, this is going to divert quite some traffic away from Munich. Just like the Munich airport diverted a lot of traffic away from Frankfurt. To some extend because many people actually want to go to Berlin, not Munich, but they currently have to change planes here or in Frankfurt. So when Berlin finally is operational, this will have an impact on Munich.

And speaking of the Berlin airport, this is a good example to not trust the numbers and our politicians. It is another example of a way-over-budget, way-behind-time project the politicians screwed up badly and where they lied to us. If we should not have trusted them with Berlin, why should we trust them with the Munich addon?

A lot of people whose families have been living there for years will have to be resettled. Whole towns are going to disappear. An airport is huge. Yet, they cannot vote against it, because their small towns do not own shares of the airport. The polticians don't even talk to them, not even to their poltical representatives.

Last but not least, the airport is in a sensitive ecological area. The directly affected area is an European special protection area for wild birds. There are nature preserves nearby, and all this area already suffers badly from airport drainage, pollution and noise. When they built the airport originally, the replacement areas they setup were badly done, and are mostly inhabited by nettles and goldenrod (which is not even native to Europe). See this article in S ddeutsche Zeitung on the impact on nature. You can't replace the loss of the original habitats just by digging some pools and filling them with water ...

If you want more information, go to this page, by Bund Naturschutz.

This is not about progress ("Fortschritt"). That is a killer argument the politicians love, but it doesn't hold. Nobody is trying to shut down the airport. Munich will be better of by keeping the balance having both a reasonably sized airport (and in fact, the airport is already one of the 10 largest in Europe!) and preserving some nature to make it worth living here.

If you are located in Munich, please go vote against the airport extension, and enjoy the DEN-GER soccer game afterwards. Thank you.

9 June 2012

Erich Schubert: DMOZ dieing

Sometime in the late 1990s I became a DMOZ editor for my local area. At that time, when the internet was a nieche thing and I was still a kid, I was actually operating a web site that had a similar goal as the corresponding category for a non-profit organization.

In the following years, I would occasionally log in, try to review some pages. It was a really scary experience: it was still exactly the same, web 0.5 experience. You had a spreadsheet type of view, tons of buttons, and it would take like 10 page loads to just review a single site. A lot of the time, you would end up search a more appropriate category, copy the URL, replace some URL-encoded special characters, paste it in one out of 30 fields on the form just to move the suggested site to a more appropriate category. Most of the edits would be by bots that detected a dead link and disabled it by moving it to the review stage. While at the same time, every SEO manual said you need to be listed on DMOZ, so people would mass-submit all kinds of stuff to DMOZ in any category that it could in any way fit in.

Then AOL announced DMOZ 2.0. And everybody probably thought: about time to refresh the UI and make everything more usable. But it didn't. First of all, it came late (announced in 2008, actually delivered sometime in 2010), then it was incredibly buggy in the beginning. They re-launched 2.0 at least two times. For quite some time, editors would be unable to login.

When DMOZ 2.0 came, my account was already "inactive", but I was able to get it re-activated. And it still looked the same. I read they changed from Times to Arial, and probably changed some CSS. But other than that, it was still as complicated to edit links as you could make it. So I did just a few changes then lost interest largely again.

During the last year I must have tried to give it another try multiple times. But my account had expired again, and I never got a reply to my reinstatement request.

A year ago finall Google Directory - the most prominent use of DMOZ/ODP data, although the users were totally unaware of it - was discontinued, too.

So by now, DMOZ seems to be as dead as it can get (they don't even bother to answer former contributors that want to get reinstated). The links are old, and if it weren't for bots to disable dead sites, it would probably look like an internet graveyard. But this poses an interesting question: will someone come up with a working "web 2.0 social" idea of the "directory" concept (I'm not talking about Digg and these classic "social bookmarking" dead ducks)? Something that strikes the right balance of on one hand the web page admins (and the SEO gold diggers) being allowed to promote their sites (and keep the data accurate) and at the same time crowd-sourcing the quality control, while also opening the data? To some extend, Facebook and Google+ can do this, but they're largely walled gardens. But they don't have real social quality assurance; money is key there.

31 May 2012

Russell Coker: Links May 2012

Vijay Kumar gave an interesting TED talk about autonomous UAVs [1]. His research is based on helicopters with 4 sets of blades and his group has developed software to allow them to develop maps, fly in formation, and more. Hadiyah wrote an interesting post about networking at TED 2012 [2]. It seems that giving every delegate the opportunity to have their bio posted is a good conference feature that others could copy. Bruce Schneier wrote a good summary of the harm that post-911 airport security has caused [3]. Chris Neugebauer wrote an insightful post about the drinking culture in conferences, how it excludes people and distracts everyone from the educational purpose of the conference [4]. Matthew Wright wrote an informative article for Beyond Zero Emissions comparing current options for renewable power with the unproven plans for new nuclear and fossil fuel power plants [5]. The Free Universal Construction Kit is a set of design files to allow 3D printing of connectors between different types of construction kits (Lego, Fischer Technic, etc) [6]. Jay Bradner gave an interesting TED talk about the use of Open Source principles in cancer research [7]. He described his research into drugs which block cancer by converting certain types of cancer cell into normal cells and how he shared that research to allow the drugs to be developed for clinical use as fast as possible. Christopher Priest wrote an epic blog post roasting everyone currently associated with the Arthur C. Clarke awards, he took particular care to flame Charles Stross who celebrated The Prestige of such a great flaming by releasing a t-shirt [8]. For a while I ve been hoping that an author like Charles Stross would manage to make more money from t-shirt sales than from book sales, Charles is already publishing some of his work for free on the Internet and it would be good if he could publish it all for free. Erich Schubert wrote an interesting post about the utility and evolution of Favebook likes [9]. Richard Hartmann wrote an interesting summary of the problems with Google products that annoy him the most [10]. Sam Varghese wrote an insightful article about the political situation in China [11]. The part about the downside of allowing poorly educated people to vote seems to apply to the US as well. Sociological Images has an article about the increased rate of Autism diagnosis as social contagion [12]. People who get their children diagnosed encourage others with similar children to do the same. Vivek wrote a great little post about setting up WPA on Debian [13]. It was much easier than expected once I followed that post. Of course I probably could have read the documentation for ifupdown, but who reads docs when Google is available?

Links March 2012 Washington s Blog has an informative summary of recent articles about...
Links April 2012 Karen Tse gave an interesting TED talk about how to...
Links February 2012 Sociological Images has an interesting article about the attempts to...

11 April 2012

Erich Schubert: Are likes still worth anything?

When Facebook became "the next big thing", you had the "like" buttons pop up on various web sites. An of course "going viral" was the big thing everybody talked about, in particular SEO experts (or those that would like to be that).

But things have changed. In particular Facebook has. In the beginning, any "like" would be announced in the newsfeed to all your friends. This was what allowed likes to go viral, when your friends re-liked the link. This is what made it attractive to have like buttons on your web pages. (Note that I'm not referring to "likes" of a single Facebook post; they are something quite different!)

Once that everybody "knew" how important this was, everbody tried to make the most out of it. In particular scammers, viruses and SEO people. Every other day, some clickjacking application would flood Facebook with likes. Every backwater website was trying to get more audience by getting "liked". But at some point Facebook just stopped showing "likes". This is not bad. It is the obvious reaction when people get too annoyed by the constant "like spam". Facebook had to put an end to this.

But now that a "like" is pretty much worthless (in my opinion). Still, many people following "SEO Tutorials" are all crazy about likes. Instead, we should reconsider whether we really want to slow down our site loading by having like buttons on every page. A like button is not as lightweight as you might think it is. It's a complex JavaScript that tries to detect clickjacking attacks, and in fact invades your users' privacy, up to the point where for example in Germany it may even be illegal to use the Facebook like button on a web site.

In a few months, the SEO people will realize that the "like"s are a fad now, and will likely all try to jump the Google+ bandwagon. Google+ is probably not half as much a "dud" as many think it is (because their friends are still on Facebook and because you cannot scribble birthday wishes on a wall in Google+). The point is that Google can actually use the "+1" likes to improve everyday search results. Google for something a friend liked, and it will show up higher in the search results, and Google will show the friend who recommended it. Facebook cannot do this, because it is not a search engine (well, you can use it for searching people, although Ark probably is better at this, and one does nowhere search as many people as one does regular web searches). Unless they go into a strong partnership with Microsoft Bing or Yahoo, the "like"s can never be as important as Google "+1" likes. So don't underestimate the Google+ strategy on the long run.

There are more points where Facebook by now is much less useful as it used to be. For example event invitations. When Facebook was in full growth, you could essentially invite all your friends to your events. You could also use lists to organize your friends, and invite only the appropriate subset, if you cared enough. The problem again was: nobody cared enough. Everybody would just invite all their friends, and you would end up getting "invitation spam" several times a day. So again Facebook had to change and limit the invitation capabilities. You can no longer invite all, or even just all on one particular list. There are some tools and tricks that can work around this to some extend, but once everybody uses that, Facebook will just have to cut it down even further.

Similarly, you might remember "superpoke" and all the "gift" applications. Facebook (and the app makers) probably made a fortune on them with premium pokes and gifts. But then this too reached a level that started to annoy the users, so they had to cut down the ability of applications to post to walls. And boom, this segment essentially imploded. I havn't seen numbers on Facebook gaming, and I figure that by doing some special setup for the games Facebook managed to keep them somewhat happy. But many will remember the time when the newsfeed would be full of Farmville and Mafia Wars crap ... it just does no longer work this way.

So when working with Facebook and such, you really need to be on the move. Right now it seems that groups and applications are more useful to get that viral dream going. A couple of apps such as Yahoo currently require you to install their app (which then may post to your wall on your behalf and get your personal information!) to follow a link shared this way, and then can actively encourage you to reshare. And messages sent to a "Facebook group" are more likely to reach people that aren't direct friends of yours. When friends actually "join" an event, this is currently showing up in the news feed. But all of this can change with 0 days notice.

It will be interesting to see if Facebook can on the long run keep up with Googles ability to integrate the +1 likes into search results. It probably takes just a few success stories in the SEO community to become the "next big thing" in SEO to get +1 instead of Facebook likes. Then Google just has to wait for them to virally spread +1 adoption. Google can wait - its Google Plus growth rates aren't bad, and they have a working business model already that doesn't rely on the extra growth - they are big already and make good profits.

Facebook however is walking on a surprisingly thin line. They need a tight control on the amount of data shared (which is probably why they try to do this with "magic"). People don't want to have the impression that Facebook is hiding something from them (although it is in fact suppressing a huge part of your friends activity!), but they also don't want to get all this data spammed onto them. And in particular, it needs to give the web publishers and app developers the right amount of that extra access to the users, while in turn keeping the major spam away from the users.

Independent of the technology and actual products, it will be really interesting to see if we manage to find some way to keep the balance in "social" one-to-many communication right. It's not a fault of Facebook that many people "spam" all their friends with all their "data". Googles Circles probably isn't the final answer either. The reason why email still works rather well was probably because it makes one-to-one communication easier than one-to-many, because it isn't realtime, and because people expect you to put enough effort into composing your mails and choosing the right receipients for the message. Current "social" communication is pretty much posting everything to everyone you know adressed as "to whoever it may concern". Much of it is in fact pretty non-personal or even non-social. We have definitely reached the point where more data is shared than is being read. Twitter is probably the most extreme example of a "write-only" medium. The average number a tweet is read by a human except the original poster must be way below 1, and definitely much less than the average number of "followers".

So in the end, the answer may actually be a good automatic address book, with automatic groups and rich clients, to enable everybody to easily use email more efficiently. On the other hand, separting "serious" communication from "entertainment" communication may be well worth having a separate communications channel, and email definitely is dated and is having spam problems.

15 March 2012

Erich Schubert: ELKI applying for GSoC 2012

I've submitted an organization application to the Google Summer of Code 2012 for ELKI - an open source data mining framework I'm working on.

I hope I can get spots for 1-2 students to help implementing additional state of the art methods, to allow for even more broad comparisons. Acceptance notification will be tomorrow. I have no idea how high out chances are of getting accepted. We're open source, as part of the university we have a proven record of educating students (in fact, around two dozen have by now contributed to ELKI, although not all of this has been "released" yet). We're rather small compared to e.g. Gnome, Debian or Apache and just a few years old. But I believe we are trying to fill an important gap in the research and open source interaction: way too many algorithms get published in science, but are never made available as source code to test and compare. This is where we try to step in: make many largely overlooked methods available. Of course we also have k-means (who hasn't?), but there is much more than just k-means! And there are plenty of methods often cited in scientific literature, yet nobody seems to have a working implementation of them ...

While most people equal data mining to prediction and classification (both of which are actually more machine learning topics), ELKI is strong on cluster analysis and outlier detection as well as index structures. Plus, it is much more flexible than other frameworks. For example, we allow almost arbitrary combinations of algorithms and distance functions. So with our tool, you can easily test your own distance function by plugging it into various algorithms. Otherwise we could just have used Weka.

The other reason why we did not extend Weka (or used R) is that we actually wanted to not just implement some algorithms, but actually be able to study the effects of index structures on the algorithms. And in my opinion, this is actually the key difference between true data mining and ML, AI or "statistics". In ML and statistics, the key objective is the result quality. Which is often rather easy to measure, too. For "full" data mining, one also needs to consider all the issues of actually managing the data, indexing it to accelerate computations. Plus, the prime objective is to discover something new, that you did not know before.

Of course these things cannot be completely separated. You can of course discover patterns and rules in the data that will allow you to make good predictions. But data mining is not just prediction and classification!

The next release of ELKI, 0.5, will be released in early april. ELKI 0.4 is already available in Debian testing and Ubuntu. The next release has a focus on comparing clustering results, including an experimental visualization for clustering differences, which will be presented at the ICDE 2012 conference.

On the big to-do list is a lot of stuff. In particular, ELKI is currently mostly useful for research, as it is too difficult to use for the average business guy wanting to do data mining. You can think of it more as a prototyping system: try out what works for you, then implement that within your larger project in an integrated and optimized way. The other big thing that I'm unhappy with is the visualization speed. SVG is great for print export, but Batik can be really slow, and the XML-DOM-API isn't exactly "accessible" to students wanting to add new visualizations. Right now, it is all useful and okay for my kind of experiments, but it could be so much more if we could solve the speed (and memory) issues of the pure SVG based approach. I'd love to see something much faster here, but with a SVG export for editing and printing.

14 March 2012

Erich Schubert: Google Scholar, Mendeley and unreliable sources

Google Scholar and Mendeley need to do more quality control.

Take for example the article

A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms
Hans-Peter Kriegel, Peer Kr ger, Erich Schubert, Arthur Zimek
Scientific and Statistical Database Management (SSDBM 2008)

(part of my diploma thesis).

Apparently, someone screwed up entering the data into Mendeley and added the editors to the authors. Now Google happily imported this data into Google Scholar, and keeps on reporting the authors incorrectly, too. Of course, many people will again import this incorrect data into their bibtex files, upload it to Mendeley and others...

Yet, neither Google Scholar nor Mendeley has an option for reporting such an error. They don't even realize that maybe Springerlink - where the DOI points to - is the more reliable source.

On the contrary, Google Scholar just started suggesting to me that the editors are coauthors ...

They really need to add an option to fix such errors. There is nothing wrong with having errors in gathered data, but you need to have a way of fixing them.

25 January 2012

Russell Coker: SE Linux Status in Debian 2012-01

Since my last SE Linux in Debian status report [1] there have been some significant changes. Policy Last year I reported that the policy wasn t very usable, on the 18th of January I uploaded version 2:2.20110726-2 of the policy packages that fixes many bugs. The policy should now be usable by most people for desktop operations and as a server. Part of the delay was that I wanted to include support for systemd, but as my work on systemd proceeded slowly and others didn t contribute policy I could use I gave up and just released it. Systemd is still a priority for me and I plan to use it on all my systems when Wheezy is released. Kernel Some time between Debian kernel 3.0.0-2 and 3.1.0-1 support for an upstream change to the security module configuration was incorporated. Instead of using selinux=1 on the kernel command line to enable SE Linux support the kernel option is security=selinux. This change allows people to boot with security=tomoyo or security=apparmor if they wish. No support for Smack though. As the kernel silently ignores command line parameters that it doesn t understand so there is no harm in having both selinux=1 and security=selinux on both older and newer kernels. So version 0.5.0 of selinux-basics now adds both kernel command-line options to GRUB configuration when selinux-activate is run. Also when the package is upgraded it will search for selinux=1 in the GRUB configuration and if it s there it will add security=selinux. This will give users the functionality that they expect, systems which have SE Linux activated will keep running SE Linux after a kernel upgrade or downgrade! Prior to updating selinux-basics systems running Debian/Unstable won t work with SE Linux. As an aside the postinst file for selinux-basics was last changed in 2006 (thanks Erich Schubert). This package is part of the new design of SE Linux in Debian and some bits of it haven t needed to be changed for 6 years! SE Linux isn t a new thing, it s been in production for a long time. Audit While the audit daemon isn t strictly a part of SE Linux (each can be used without the other) it seems that most of the time they are used together (in Debian at least). I have prepared a NMU of the new upstream version of audit and uploaded it to delayed/7. I want to get everything related to SE Linux up to date or at least with comparable versions to Fedora. Also I sent some of the Debian patches for the auditd upstream which should reduce the maintenance effort in future. Libraries There have been some NMUs of libraries that are part of SE Linux. Due to a combination of having confidence in the people doing the NMUs and not having much spare time I have let them go through without review. I m sure that I will notice soon enough if they don t work, my test systems exercise enough SE Linux functionality that it would be difficult to break things without me noticing. Play Machine I am now preparing a new SE Linux Play Machine running Debian/Unstable. I wore my Play Machine shirt at LCA so I ve got to get one going again soon. This is a good exercise of the strict features of SE Linux policy, I ve found some bugs which need to be fixed. Running Play Machines really helps improve the overall quality of SE Linux.

[1] http://etbe.coker.com.au/2011/10/31/selinux-status-2011-10/

Status of SE Linux in Debian LCA 2009 This morning I gave a talk at the Security mini-conf...
SE Linux in Debian I have now got a Debian Xen domU running the...
Debian SE Linux Status At the moment I ve got more time to work on...

9 October 2011

Erich Schubert: Class management system

Dear Lazyweb.

A friend of mine is looking for a small web application to manage tiny classes (as in course, not as in computing). They usually span just four dates, and people will often sign up for the next class then. Usually 10-20 people per class, although some might not sign up via internet.
We deliberately don't want to require them to fully register for the web site and go through all that registration, email verification etc. trouble. Anything that takes more than filling out the obviously required form will just cause trouble.

At first it sounded like this is a common task, but in essence all the systems I've seen so far are totally overpowered for this. There are no grades, no working groups, no "customer relationship management". There isn't much more needed than the ability to easily configure the classes, have people book them, and get the list of singed up users into a spreadsheet easily (CSV will do).

It must be able to run on the typical PHP+MySQL web hoster and be open source.

Any recommendations? Drop me a comment or email at erich () debian org Thank you.

28 September 2011

Erich Schubert: Privacy in the public opinion

Many people in the united states seem to have the opinion, that the "public is willing to give up most of their privacy" in particular when dealing with online services such as Facebook. I believe in his keynote at ECML-PKDD, Albert-L szl Barab si of Harvard University expessed such a view, that this data will just become more and more available. I'm not sure if it was him or someone else (I believe it was someone else) that essentially claimed "privacy is irrelevant". Another popular opinion is that "it's only old people caring for privacy".

However, just like politics, these things tend to oscillate from one extreme to another. For example, the recent years in Europe, conservative parties were winning one election after another. Now in France, the socialist parties have just won the senate, the conservative parties in Germany are losing in one state after the other and so on. And this will change back again, too. Democracy also lives from changing roles in government, as this drives both progress and fights corruption.

We might be seeing the one extreme in the united states right now, where people are readily giving away their location and interests for free access to a web site. This can swing back any time.

In Germany, one of the government parties - the liberal democrats, FDP - just dropped out of the Berlin city government, down to 1.8% of voters. Yes, this is the party the German foreign minister, Guido Westerwelle is from. The pirate party [en.wikipedia.org] - much of their program is about privacy, civil rights, copyright reforms and the internet - which didn't even participate in the previous elections since they were was founded just 5 years ago jumped to 8.9%, scoring higher than the liberal democrats did in the previous elections. In 2009 they scored a surprising high 2% in the federal elections - current polls see them anywhere from 4% to 7% at the federal level, so they will probably get seats in parliament in 2013. (There are also other reasons why the liberal democrats have been losing voters so badly, though! Their current numbers indicate they might drop out of parliament in 2013.)

The Greens in Germany, which are also very much oriented towards privacy and civil rights, are also on the rise, and in march just became the second strongest party and senior partner in the governing coalition of Baden-W rttemberg, which historically was a heartland of the conservatives.

So don't assume that privacy is irrelevant nowadays. The public opinion can swing quickly. In particular in democratic systems that have room for more than two parties - so probably not in the united states - such topics can actually influence elections a lot. Within 30 years, the Greens now frequently reach values of 20% in federal polls and up to 30% in some states. It doesn't look as if they are going to go away soon.

Also don't assume that it's just old people caring about privacy - in Germany, in particular the pirate party and the Greens are very much favored by the young people. The typical voter for the pirates is less than 30 years old, male, has a higher education and works in the media or internet business.

In Germany, much of the protest for more privacy - and against the too readily data collection by companies such as Facebook and Google - is driven by the young internet-users and -workers. I believe this will be similar in other parts of Europe - there are other pirate parties all over Europe. And this can happen to the united states any time, too.

Electronic freedom - e.g. pushed by the Electronic Frontier Foundation, but also the open source movement - does have quite a history in the united states. But in particular open source has made such a huge progress the last decade, these movements in the US could just be a bit out of breath right now. I'm sure they will come back with a strong push against the privacy invasions we're seeing right now. And that can likely take down a giant like Facebook, too. So don't bet on people continuing to give up their privacy!

Next.

Previous.